USO DEL ANÁLISIS DE AGRUPAMIENTO DE PAM PARA AGRUPAR PAÍSES POR RIQUEZA, DESARROLLO, EMISIONES DE CARBONO Y FELICIDAD
TABLA DE CONTENIDOS SUGERIDO
IMPORTACIÓN DE BIBLIOTECAS
CARGA DE DATOS
LIMPIEZA DE DATOS
PREPROCESAMIENTO DE DATOS
SIEMPRE (CASI) ESCALAR LOS DATOS.
ANÁLISIS DE COMPONENTES PRINCIPALES (PCA)
USO DEL ANÁLISIS DE AGRUPAMIENTO DE PAM PARA AGRUPAR PAÍSES POR RIQUEZA, DESARROLLO, EMISIONES DE CARBONO Y FELICIDAD
UN MAPA MUNDIAL DE LOS CLÚSTERES
FIN
1. - IMPORTACION DE BIBLIOTECAS
Importar librerias
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plotly)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(stringr)
library(cluster)
library(FactoMineR)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(reshape2)
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.2
library(NbClust)
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.2.2
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.2.2
2. - CARGA DE DATOS
Cargando el conjunto de datos
# importar los datos usando la libreria "xlsx"
library(xlsx)
## Warning: package 'xlsx' was built under R version 4.2.2
hpi <- read.xlsx('data.xlsx',sheetIndex = 5, header = TRUE)
Mostrar contenido del dataset
# MOSTRAR LAS PRIMERAS FILAS
head(hpi)
## NA. NA..1 NA..2 NA..3
## 1 NA Complete dataset <NA> <NA>
## 2 NA <NA> <NA> <NA>
## 3 NA <NA> <NA> <NA>
## 4 NA <NA> <NA> <NA>
## 5 NA HPI Rank Country Region
## 6 NA 110 Afghanistan Middle East and North Africa
## NA..4 NA..5 NA..6
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 Average Life \nExpectancy Average Wellbeing\n(0-10) Happy Life Years
## 6 59.668 3.8 12.3960238087402
## NA..7 NA..8
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 Footprint\n(gha/capita) Inequality of Outcomes
## 6 0.79 0.426557441321005
## NA..9 NA..10
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 Inequality-adjusted Life Expectancy Inequality-adjusted Wellbeing
## 6 38.3488176249152 3.39049431767374
## NA..11 NA..12 NA..13 NA..14 NA..15
## 1 <NA> <NA> <NA> <NA> NA
## 2 <NA> <NA> <NA> <NA> NA
## 3 <NA> <NA> <NA> <NA> NA
## 4 <NA> <NA> <NA> <NA> NA
## 5 Happy Planet Index GDP/capita\n($PPP) Population GINI index NA
## 6 20.2253497709571 690.842629014956 29726803 Data unavailable NA
## NA..16 NA..17 NA..18 NA..19 NA..20 NA..21 NA..22 NA..23 NA..24 NA..25 NA..26
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA
## NA..27 NA..28 NA..29 NA..30 NA..31 NA..32 NA..33 NA..34 NA..35 NA..36 NA..37
## 1 NA NA NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA NA NA
## NA..38 NA..39 NA..40 NA..41
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
# MOSTRAR LAS ULTIMAS FILAS
tail(hpi)
## NA. NA..1
## 158 NA <NA>
## 159 NA <NA>
## 160 NA <NA>
## 161 NA <NA>
## 162 NA <NA>
## 163 NA <NA>
## NA..2
## 158 <NA>
## 159 GDP/capita \n($PPP)
## 160 Population
## 161 GINI
## 162 For further information about the data sources and calculation methodology, please see the Methodology paper
## 163 <NA>
## NA..3
## 158 <NA>
## 159 2012 data (or nearest available year) World Development Indicators, The World Bank
## 160 2012 data (or nearest available year) World Development Indicators, The World Bank
## 161 2012 data (or nearest available year) World Development Indicators, The World Bank
## 162 <NA>
## 163 <NA>
## NA..4 NA..5 NA..6 NA..7 NA..8 NA..9 NA..10 NA..11 NA..12 NA..13 NA..14
## 158 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 159 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 160 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 161 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 162 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 163 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## NA..15 NA..16 NA..17 NA..18 NA..19 NA..20 NA..21 NA..22 NA..23 NA..24
## 158 NA NA NA NA NA NA NA NA NA NA
## 159 NA NA NA NA NA NA NA NA NA NA
## 160 NA NA NA NA NA NA NA NA NA NA
## 161 NA NA NA NA NA NA NA NA NA NA
## 162 NA NA NA NA NA NA NA NA NA NA
## 163 NA NA NA NA NA NA NA NA NA NA
## NA..25 NA..26 NA..27 NA..28 NA..29 NA..30 NA..31 NA..32 NA..33 NA..34
## 158 NA NA NA NA NA NA NA NA NA NA
## 159 NA NA NA NA NA NA NA NA NA NA
## 160 NA NA NA NA NA NA NA NA NA NA
## 161 NA NA NA NA NA NA NA NA NA NA
## 162 NA NA NA NA NA NA NA NA NA NA
## 163 NA NA NA NA NA NA NA NA NA NA
## NA..35 NA..36 NA..37 NA..38 NA..39 NA..40 NA..41
## 158 NA NA NA NA NA NA NA
## 159 NA NA NA NA NA NA NA
## 160 NA NA NA NA NA NA NA
## 161 NA NA NA NA NA NA NA
## 162 NA NA NA NA NA NA NA
## 163 NA NA NA NA NA NA NA
- Se observa que hay presencia de datos faltantes tanto en filas y columnas, asi como también e información poco relevante
3. - LIMPIEZA DE DATOS
identificacion y tratamiento de datos nulos
# Mostrar la cantidad de datos nulos por cada columna
colSums(is.na(hpi))
## NA. NA..1 NA..2 NA..3 NA..4 NA..5 NA..6 NA..7 NA..8 NA..9 NA..10
## 163 21 13 15 22 22 22 22 22 22 22
## NA..11 NA..12 NA..13 NA..14 NA..15 NA..16 NA..17 NA..18 NA..19 NA..20 NA..21
## 22 22 22 22 163 163 163 163 163 163 163
## NA..22 NA..23 NA..24 NA..25 NA..26 NA..27 NA..28 NA..29 NA..30 NA..31 NA..32
## 163 163 163 163 163 163 163 163 163 163 163
## NA..33 NA..34 NA..35 NA..36 NA..37 NA..38 NA..39 NA..40 NA..41
## 163 163 163 163 163 163 163 163 163
# Mostrar % de datos nulos por cada columna
porcentajeMiss <- function(x) {sum(is.na(x)) / length(x)*100}
apply(hpi, 2, porcentajeMiss)
## NA. NA..1 NA..2 NA..3 NA..4 NA..5 NA..6
## 100.000000 12.883436 7.975460 9.202454 13.496933 13.496933 13.496933
## NA..7 NA..8 NA..9 NA..10 NA..11 NA..12 NA..13
## 13.496933 13.496933 13.496933 13.496933 13.496933 13.496933 13.496933
## NA..14 NA..15 NA..16 NA..17 NA..18 NA..19 NA..20
## 13.496933 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
## NA..21 NA..22 NA..23 NA..24 NA..25 NA..26 NA..27
## 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
## NA..28 NA..29 NA..30 NA..31 NA..32 NA..33 NA..34
## 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
## NA..35 NA..36 NA..37 NA..38 NA..39 NA..40 NA..41
## 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000 100.000000
-Se observa que desde la columna “NA..15” en adelante presentan un 100% de valores nulos, los cuales tendremos que eliminar ya que no poseen ningún dato, al igual que la columna “NA.”
Eliminando contenido poco relevante
# Eliminando columnas con datos nulos
hpi_temp <- hpi[,-c(1,16:42)]
head(hpi_temp)
## NA..1 NA..2 NA..3
## 1 Complete dataset <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 HPI Rank Country Region
## 6 110 Afghanistan Middle East and North Africa
## NA..4 NA..5 NA..6
## 1 <NA> <NA> <NA>
## 2 <NA> <NA> <NA>
## 3 <NA> <NA> <NA>
## 4 <NA> <NA> <NA>
## 5 Average Life \nExpectancy Average Wellbeing\n(0-10) Happy Life Years
## 6 59.668 3.8 12.3960238087402
## NA..7 NA..8
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 Footprint\n(gha/capita) Inequality of Outcomes
## 6 0.79 0.426557441321005
## NA..9 NA..10
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
## 5 Inequality-adjusted Life Expectancy Inequality-adjusted Wellbeing
## 6 38.3488176249152 3.39049431767374
## NA..11 NA..12 NA..13 NA..14
## 1 <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
## 5 Happy Planet Index GDP/capita\n($PPP) Population GINI index
## 6 20.2253497709571 690.842629014956 29726803 Data unavailable
- Observando el archivo excel se detectó que hay datos nulos e innecesarios al inicio y al final del dataset
# Eliminando filas con datos nulos y contenido no relevante
hpi_clean <- hpi_temp[-c(1:5,146:163),]
head(hpi_clean)
## NA..1 NA..2 NA..3 NA..4 NA..5 NA..6
## 6 110 Afghanistan Middle East and North Africa 59.668 3.8 12.3960238087402
## 7 13 Albania Post-communist 77.347 5.5 34.4147360108721
## 8 30 Algeria Middle East and North Africa 74.313 5.6 30.4694613112307
## 9 19 Argentina Americas 75.927 6.5 40.1666738745799
## 10 73 Armenia Post-communist 74.446 4.3 24.0187600607023
## 11 105 Australia Asia Pacific 82.052 7.2 53.0694977095264
## NA..7 NA..8 NA..9 NA..10 NA..11
## 6 0.79 0.426557441321005 38.3488176249152 3.39049431767374 20.2253497709571
## 7 2.21 0.165133717343957 69.6711592798365 5.09765040048744 36.76687443911
## 8 2.12 0.244861747889588 60.4745445936025 5.1964491855983 33.3005427886154
## 9 3.14 0.164238301077711 68.3495831090017 6.03470703959787 35.1902436422946
## 10 2.23 0.216648103296311 66.9216823230103 3.74713959801918 25.6664172884369
## 11 9.31 0.0806782476710599 78.5600237730927 6.9133349323794 21.2289663082594
## NA..12 NA..13 NA..14
## 6 690.842629014956 29726803 Data unavailable
## 7 4247.48543681502 2900489 28.96
## 8 5583.61615950131 37439427 Data unavailable
## 9 14357.4115893903 42095224 42.49
## 10 3565.5175749254 2978339 30.48
## 11 67646.1038529626 22728254 Data unavailable
Renombrando las columnas
names(hpi_clean) <- c("HPI_Rank", "Country", "Region", "Average_Life_Expectancy", "Average_Wellbeing", "Happy_Life_Years", "Footprint", "Inequality_of_Outcomes", "Inequality_adjusted_Life_Expectancy", "Inequality_adjusted_Wellbeing", "Happy_Planet_Index", "GDP_capita", "Population", "GINI_index")
rownames(hpi_clean) <- NULL # resetear los indices
head(hpi_clean)
## HPI_Rank Country Region Average_Life_Expectancy
## 1 110 Afghanistan Middle East and North Africa 59.668
## 2 13 Albania Post-communist 77.347
## 3 30 Algeria Middle East and North Africa 74.313
## 4 19 Argentina Americas 75.927
## 5 73 Armenia Post-communist 74.446
## 6 105 Australia Asia Pacific 82.052
## Average_Wellbeing Happy_Life_Years Footprint Inequality_of_Outcomes
## 1 3.8 12.3960238087402 0.79 0.426557441321005
## 2 5.5 34.4147360108721 2.21 0.165133717343957
## 3 5.6 30.4694613112307 2.12 0.244861747889588
## 4 6.5 40.1666738745799 3.14 0.164238301077711
## 5 4.3 24.0187600607023 2.23 0.216648103296311
## 6 7.2 53.0694977095264 9.31 0.0806782476710599
## Inequality_adjusted_Life_Expectancy Inequality_adjusted_Wellbeing
## 1 38.3488176249152 3.39049431767374
## 2 69.6711592798365 5.09765040048744
## 3 60.4745445936025 5.1964491855983
## 4 68.3495831090017 6.03470703959787
## 5 66.9216823230103 3.74713959801918
## 6 78.5600237730927 6.9133349323794
## Happy_Planet_Index GDP_capita Population GINI_index
## 1 20.2253497709571 690.842629014956 29726803 Data unavailable
## 2 36.76687443911 4247.48543681502 2900489 28.96
## 3 33.3005427886154 5583.61615950131 37439427 Data unavailable
## 4 35.1902436422946 14357.4115893903 42095224 42.49
## 5 25.6664172884369 3565.5175749254 2978339 30.48
## 6 21.2289663082594 67646.1038529626 22728254 Data unavailable
- Verificando que el dataset se encuentre libre de datos nulos
sum(is.na(hpi_clean))
## [1] 0
4. - PREPROCESAMIENTO DE DATOS
corregir el tipo de dato de cada columna
# Se observa que el tipo de dato en la tabla no son los correctos respecto a su contenido
hpi_clean$HPI_Rank <- as.integer(hpi_clean$HPI_Rank)
hpi_clean$Region <- as.factor(hpi_clean$Region)
hpi_clean$Average_Life_Expectancy <- as.numeric(hpi_clean$Average_Life_Expectancy)
hpi_clean$Average_Wellbeing <- as.numeric(hpi_clean$Average_Wellbeing)
hpi_clean$Happy_Life_Years <- as.numeric(hpi_clean$Happy_Life_Years)
hpi_clean$Footprint <- as.numeric(hpi_clean$Footprint)
hpi_clean$Inequality_of_Outcomes <- as.numeric(hpi_clean$Inequality_of_Outcomes)
hpi_clean$Inequality_adjusted_Life_Expectancy <- as.numeric(hpi_clean$Inequality_adjusted_Life_Expectancy)
hpi_clean$Inequality_adjusted_Wellbeing <- as.numeric(hpi_clean$Inequality_adjusted_Wellbeing)
hpi_clean$Happy_Planet_Index <- as.numeric(hpi_clean$Happy_Planet_Index)
hpi_clean$GDP_capita <- as.numeric(hpi_clean$GDP_capita)
hpi_clean$Population <- as.numeric(hpi_clean$Population)
hpi_clean$GINI_index <- as.factor(hpi_clean$GINI_index)
descripción de los datos
str(hpi_clean)
## 'data.frame': 140 obs. of 14 variables:
## $ HPI_Rank : int 110 13 30 19 73 105 43 8 102 87 ...
## $ Country : chr "Afghanistan" "Albania" "Algeria" "Argentina" ...
## $ Region : Factor w/ 6 levels "Americas","Asia Pacific",..: 4 5 4 1 5 2 3 2 5 3 ...
## $ Average_Life_Expectancy : num 59.7 77.3 74.3 75.9 74.4 ...
## $ Average_Wellbeing : num 3.8 5.5 5.6 6.5 4.3 7.2 7.4 4.7 5.7 6.9 ...
## $ Happy_Life_Years : num 12.4 34.4 30.5 40.2 24 ...
## $ Footprint : num 0.79 2.21 2.12 3.14 2.23 9.31 6.06 0.72 5.09 7.44 ...
## $ Inequality_of_Outcomes : num 0.427 0.165 0.245 0.164 0.217 ...
## $ Inequality_adjusted_Life_Expectancy: num 38.3 69.7 60.5 68.3 66.9 ...
## $ Inequality_adjusted_Wellbeing : num 3.39 5.1 5.2 6.03 3.75 ...
## $ Happy_Planet_Index : num 20.2 36.8 33.3 35.2 25.7 ...
## $ GDP_capita : num 691 4247 5584 14357 3566 ...
## $ Population : num 29726803 2900489 37439427 42095224 2978339 ...
## $ GINI_index : Factor w/ 65 levels "24.74","25.59",..: 65 15 65 50 18 65 18 65 4 13 ...
Mostrando el resumen estadistico
summary(hpi_clean)
## HPI_Rank Country Region
## Min. : 1.00 Length:140 Americas :25
## 1st Qu.: 35.75 Class :character Asia Pacific :21
## Median : 70.50 Mode :character Europe :20
## Mean : 70.50 Middle East and North Africa:14
## 3rd Qu.:105.25 Post-communist :26
## Max. :140.00 Sub Saharan Africa :34
##
## Average_Life_Expectancy Average_Wellbeing Happy_Life_Years Footprint
## Min. :48.91 Min. :2.867 Min. : 8.97 Min. : 0.610
## 1st Qu.:65.04 1st Qu.:4.575 1st Qu.:18.69 1st Qu.: 1.425
## Median :73.50 Median :5.250 Median :29.40 Median : 2.680
## Mean :70.93 Mean :5.408 Mean :30.25 Mean : 3.258
## 3rd Qu.:77.02 3rd Qu.:6.225 3rd Qu.:39.71 3rd Qu.: 4.482
## Max. :83.57 Max. :7.800 Max. :59.32 Max. :15.820
##
## Inequality_of_Outcomes Inequality_adjusted_Life_Expectancy
## Min. :0.04322 Min. :27.32
## 1st Qu.:0.13353 1st Qu.:48.21
## Median :0.21174 Median :63.41
## Mean :0.23291 Mean :60.34
## 3rd Qu.:0.32932 3rd Qu.:72.57
## Max. :0.50734 Max. :81.26
##
## Inequality_adjusted_Wellbeing Happy_Planet_Index GDP_capita
## Min. :2.421 Min. :12.78 Min. : 244.2
## 1st Qu.:4.047 1st Qu.:21.21 1st Qu.: 1628.1
## Median :4.816 Median :26.29 Median : 5691.1
## Mean :4.973 Mean :26.41 Mean : 13911.1
## 3rd Qu.:5.704 3rd Qu.:31.54 3rd Qu.: 15159.1
## Max. :7.625 Max. :44.71 Max. :105447.1
##
## Population GINI_index
## Min. :2.475e+05 Data unavailable:75
## 1st Qu.:4.248e+06 30.48 : 2
## Median :1.065e+07 24.74 : 1
## Mean :4.801e+07 25.59 : 1
## 3rd Qu.:3.343e+07 25.9 : 1
## Max. :1.351e+09 26.01 : 1
## (Other) :59
- Dimensiones del dataset despues de la limpieza
dim(hpi_clean)
## [1] 140 14
Análisis y visualización de los datos
- Graficando el primer grupo de datos
pairs.panels(hpi_clean[,1:6], pch=21,main="Gráfico : Matriz de Dispersión, Histograma y Correlación",bg=c("blue"))
Se observa una alta correlacion positiva con respecto a las variables “Average_life_expectancy”, “Average_wellbeing” y “Happy_life_years”
- Graficando el segundo grupo de datos
pairs.panels(hpi_clean[,7:13], pch=21,main="Gráfico : Matriz de Dispersión, Histograma y Correlación",bg=c("blue"))
Se observa que al relacionar algunas caracteristicas correlaciones negativas, positivas y nula
plot_histogram(hpi_clean,ggtheme = theme_linedraw())
Se observa que algunas caracteristicas presentan distribucion de campana como “Average_wellbeing”, las caracteristicas “Happy_life_years” y “life_expectancy” presentan distribucion bimodal y las caracteristicas “footprint” y “GDP_capita” presentan una distribucion asimétrica positiva
Busqueda de valores atípicos
boxplot(hpi_clean[,3:13],main = "Boxplot del índice del planeta feliz 2016",
boxwex = 0.5,col="purple")
Se observa gran cantidad de valores atipicos en la caracteristica “Population” y “GDP_capita”, para lo cual se realizará el respectivo tratamiento de outliers
Tratamiento de outliers
# creando copia
hpi_copy <- hpi_clean
# "Population"
qrts <- quantile(hpi_copy$Population, probs = c(0.25, 0.75), na.rm = TRUE)
caps <- quantile(hpi_copy$Population, probs = c(.05, .95), na.rm = TRUE)
iqr <- qrts[2]-qrts[1]
h <- 1.5 * iqr
hpi_copy$Population[hpi_copy$Population<qrts[1]-h] <- caps[1]
hpi_copy$Population[hpi_copy$Population>qrts[2]+h] <- caps[2]
# "GDP_capita"
qrts <- quantile(hpi_copy$GDP_capita, probs = c(0.25, 0.75), na.rm = TRUE)
caps <- quantile(hpi_copy$GDP_capita, probs = c(.05, .95), na.rm = TRUE)
iqr <- qrts[2]-qrts[1]
h <- 1.5 * iqr
hpi_copy$GDP_capita[hpi_copy$GDP_capita<qrts[1]-h] <- caps[1]
hpi_copy$GDP_capita[hpi_copy$GDP_capita>qrts[2]+h] <- caps[2]
dim(hpi_copy)
## [1] 140 14
- Se observa que no hubo perdida de datos y se redujo en gran medida los outliers
boxplot(hpi_copy[,3:13],main = "Boxplot del índice del planeta feliz 2016",
boxwex = 0.5,col="purple")
5. - ESCALAR LOS DATOS
- Se procede a normalizar los datos para trabajar en una misma escala
hpi_sin_outliers <- hpi_copy
hpi_scale <- hpi_sin_outliers[, 4:13]
hpi_scale <- scale(hpi_scale)
summary(hpi_scale)
## Average_Life_Expectancy Average_Wellbeing Happy_Life_Years Footprint
## Min. :-2.5153 Min. :-2.2128 Min. :-1.60493 Min. :-1.1493
## 1st Qu.:-0.6729 1st Qu.:-0.7252 1st Qu.:-0.87191 1st Qu.:-0.7955
## Median : 0.2939 Median :-0.1374 Median :-0.06378 Median :-0.2507
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6968 3rd Qu.: 0.7116 3rd Qu.: 0.71388 3rd Qu.: 0.5317
## Max. : 1.4449 Max. : 2.0831 Max. : 2.19247 Max. : 5.4532
## Inequality_of_Outcomes Inequality_adjusted_Life_Expectancy
## Min. :-1.5692 Min. :-2.2192
## 1st Qu.:-0.8222 1st Qu.:-0.8152
## Median :-0.1751 Median : 0.2060
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.7976 3rd Qu.: 0.8221
## Max. : 2.2702 Max. : 1.4059
## Inequality_adjusted_Wellbeing Happy_Planet_Index GDP_capita
## Min. :-2.1491 Min. :-1.86308 Min. :-0.7659
## 1st Qu.:-0.7795 1st Qu.:-0.71120 1st Qu.:-0.6851
## Median :-0.1317 Median :-0.01653 Median :-0.4476
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6162 3rd Qu.: 0.70106 3rd Qu.: 0.1058
## Max. : 2.2339 Max. : 2.50110 Max. : 2.2309
## Population
## Min. :-0.66932
## 1st Qu.:-0.58537
## Median :-0.45111
## Mean : 0.00000
## 3rd Qu.: 0.02693
## Max. : 2.59717
Analisis de correlacion
ggcorrplot(round(cor(hpi_scale), 1), type = "lower", lab = T, show.legend = T,tl.srt=45)
- Se observa una correlacion muy baja o nula respecto a las caracteristicas de “population”, esto quiere decir que es una caracteristicas poco relevante para el proceso de clusterizacion ya que no aporta mucha información.
- La caracteristica “inequality_of_outcome” presenta una correlacion negatica fuere respecto a las otras caracteristicas, lo cual indica que cuando una de las variables aumenta, la otra variable disminuye y viceversa.
6. - ANÁLISIS DE COMPONENTES PRINCIPALES (PCA)
hpi_clean.pca <- PCA(hpi_scale, graph=FALSE)
print(hpi_clean.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 140 individuals, described by 10 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
fviz_screeplot(hpi_clean.pca, addlabels = TRUE, ylim = c(0, 75)) +
theme_classic()
Según el gráfico, nos muestra que las 2 primeras dimensiones explica el 81.1% de representatividad de todo el conjunto de datos.
Adicionalmente las 2 primeras dimenciones retienen aproximadamente el 80% de las varianzas contenidas en el conjunto de datos.
Las 5 últimas componentes no superan por separado el 1% de varianza explicada.
NOTA: El método PCA es altamente sensible a outliers
fviz_pca_var(hpi_clean.pca, col.var="contrib",
repel = TRUE
)+scale_colour_viridis_c()
Segun el grafico de influencias se observa que la mayoria de las variables que apuntan a la derecha tienen influencias positivas grandes en la dimension 1 y la variable “inequality_of_outcome” tiene influencia negativa en la dimension 1 y 2
Segun el grafico se observa que la variable “population presenta una ortogonalidad respecto a la mayoria de variables, lo cual quiere decir que es una variable poco relevante”
7. - USO DEL ANÁLISIS DE AGRUPAMIENTO DE PAM
Estimación del número de clusters
Método del codo:
set.seed(2023)
ks <- 2:10
fviz_nbclust(hpi_scale, kmeans, method = "wss")+geom_vline(xintercept = 3, color = "red", linetype = 2)
Método de la silueta:
fviz_nbclust(hpi_scale, kmeans, method = "silhouette")
Método de la estadística de brecha
fviz_nbclust(hpi_scale, kmeans, method = "gap_stat")
La mayoría de las técnicas aplicadas para estimar el número de clusters, sugieren que se debe agrupar en 3 clusters
PROPUESTA DE ANÁLISIS PAM 1: En base a las variables solicitadas como paises por riqueza, desarrollo, emiciones de carbono y felicidad
#hpi_scale2 = hpi_scale[, 1:3]
hpi_scale2 = hpi_scale[, c(1,2,3,8)]
set.seed(2023)
pam <- pam(hpi_scale2, diss=FALSE, k = 3, keep.data=TRUE)
hpi_clean$Country[pam$id.med]
## [1] "Liberia" "Romania" "Germany"
fviz_cluster(pam, stand = FALSE, geom = "point",
ellipse.type = "norm", ggtheme = theme_classic())
**EVALUACION DE ALGORIRMO PAM PROPUESTA 1 *
fviz_silhouette(pam)
## cluster size ave.sil.width
## 1 1 42 0.48
## 2 2 60 0.31
## 3 3 38 0.40
PROPUESTA DE ANÁLISIS PAM 2: En base a la sugerencia de la técnica de PCA
Haciendo uso de las variables con mayor representatividad
hpi_scale2 = hpi_scale[, 1:3]
set.seed(2023)
pam <- pam(hpi_scale2, diss=FALSE, k = 3, keep.data=TRUE)
hpi_clean$Country[pam$id.med]
## [1] "Liberia" "Romania" "United Kingdom"
fviz_cluster(pam, stand = FALSE, geom = "point",
ellipse.type = "norm", ggtheme = theme_classic())
library(plotly)
hpi_clean['Cluster'] <- as.factor(pam$clustering)
plot_ly(hpi_clean, x = ~Average_Life_Expectancy, y = ~Average_Wellbeing, z = ~Happy_Life_Years) %>%
add_markers(color = ~Cluster)
EVALUACION DE ALGORIRMO PAM PROPUESTA 2
fviz_silhouette(pam)
## cluster size ave.sil.width
## 1 1 42 0.50
## 2 2 62 0.43
## 3 3 36 0.55
Respecto a los resultados obtenidos en las evaluaciones, se observa que para la propuesta 1 se obtuvo valores de (0.48, 0.31, 0.40) para cada cluster respectivamente con un promedio final de 0.38, sin embargo, para la propuesta 2 con PCA se obtuvo valores de (0.50, 0.43, 0.51) para cada cluster respectivamente con un promedio final de 0.48
Esto quiere decir que aplicando la tecnica de PCA el agrupamiento mejora significativamente, ya que reduce variables que aportan poca informacion o irrelevantes
EXTRAYENDO INFORMACIÓN DEL CLUSTER 1
cluster1 <- hpi_clean[hpi_clean$Cluster == 1,]
cluster1 <- cluster1[, c("Country", "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Happy_Planet_Index", "GDP_capita")]
cluster1$Country
## [1] "Afghanistan" "Benin" "Botswana"
## [4] "Burkina Faso" "Burundi" "Cambodia"
## [7] "Cameroon" "Chad" "Comoros"
## [10] "Cote d'Ivoire" "Djibouti" "Ethiopia"
## [13] "Gabon" "Ghana" "Guinea"
## [16] "Haiti" "India" "Kenya"
## [19] "Lesotho" "Liberia" "Malawi"
## [22] "Mauritania" "Mozambique" "Myanmar"
## [25] "Namibia" "Nepal" "Niger"
## [28] "Nigeria" "Pakistan" "Republic of Congo"
## [31] "Rwanda" "Senegal" "Sierra Leone"
## [34] "South Africa" "Swaziland" "Syria"
## [37] "Tanzania" "Togo" "Uganda"
## [40] "Yemen" "Zambia" "Zimbabwe"
summary(cluster1[, c(2:6)])
## Average_Life_Expectancy Average_Wellbeing Footprint Happy_Planet_Index
## Min. :48.91 Min. :2.867 Min. :0.610 Min. :12.78
## 1st Qu.:56.50 1st Qu.:3.900 1st Qu.:1.030 1st Qu.:16.62
## Median :60.27 Median :4.300 Median :1.225 Median :19.93
## Mean :59.78 Mean :4.269 Mean :1.419 Mean :20.41
## 3rd Qu.:63.34 3rd Qu.:4.696 3rd Qu.:1.498 3rd Qu.:23.01
## Max. :70.39 Max. :5.500 Max. :3.830 Max. :31.50
## GDP_capita
## Min. : 244.2
## 1st Qu.: 668.8
## Median : 996.0
## Mean : 1754.3
## 3rd Qu.: 1552.5
## Max. :10642.4
EXTRAYENDO INFORMACIÓN DEL CLUSTER 2
cluster2 <- hpi_clean[hpi_clean$Cluster == 2,]
cluster2 <- cluster2[, c("Country", "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Happy_Planet_Index", "GDP_capita")]
cluster2$Country
## [1] "Albania" "Algeria" "Armenia"
## [4] "Bangladesh" "Belarus" "Belize"
## [7] "Bhutan" "Bolivia" "Bosnia and Herzegovina"
## [10] "Bulgaria" "China" "Colombia"
## [13] "Croatia" "Dominican Republic" "Ecuador"
## [16] "Egypt" "El Salvador" "Estonia"
## [19] "Georgia" "Greece" "Guatemala"
## [22] "Honduras" "Hong Kong" "Hungary"
## [25] "Indonesia" "Iran" "Iraq"
## [28] "Jamaica" "Kazakhstan" "Kyrgyzstan"
## [31] "Latvia" "Lebanon" "Lithuania"
## [34] "Macedonia" "Malaysia" "Mauritius"
## [37] "Mongolia" "Montenegro" "Morocco"
## [40] "Nicaragua" "Palestine" "Paraguay"
## [43] "Peru" "Philippines" "Poland"
## [46] "Portugal" "Romania" "Russia"
## [49] "Serbia" "Slovakia" "Sri Lanka"
## [52] "Suriname" "Tajikistan" "Thailand"
## [55] "Trinidad and Tobago" "Tunisia" "Turkey"
## [58] "Turkmenistan" "Ukraine" "Uzbekistan"
## [61] "Vanuatu" "Vietnam"
summary(cluster2[, c(2:6)])
## Average_Life_Expectancy Average_Wellbeing Footprint Happy_Planet_Index
## Min. :65.30 Min. :4.200 Min. :0.720 Min. :14.27
## 1st Qu.:70.75 1st Qu.:4.800 1st Qu.:1.890 1st Qu.:23.66
## Median :73.94 Median :5.400 Median :2.705 Median :27.43
## Mean :73.22 Mean :5.318 Mean :3.159 Mean :28.22
## 3rd Qu.:75.02 3rd Qu.:5.800 3rd Qu.:3.910 3rd Qu.:34.14
## Max. :83.57 Max. :6.470 Max. :8.823 Max. :40.70
## GDP_capita
## Min. : 858.9
## 1st Qu.: 3404.4
## Median : 5680.8
## Mean : 7545.4
## 3rd Qu.: 9652.5
## Max. :36707.8
EXTRAYENDO INFORMACIÓN DEL CLUSTER 3
cluster3 <- hpi_clean[hpi_clean$Cluster == 3,]
cluster3 <- cluster3[, c("Country", "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Happy_Planet_Index", "GDP_capita")]
cluster3$Country
## [1] "Argentina" "Australia"
## [3] "Austria" "Belgium"
## [5] "Brazil" "Canada"
## [7] "Chile" "Costa Rica"
## [9] "Cyprus" "Czech Republic"
## [11] "Denmark" "Finland"
## [13] "France" "Germany"
## [15] "Iceland" "Ireland"
## [17] "Israel" "Italy"
## [19] "Japan" "Luxembourg"
## [21] "Malta" "Mexico"
## [23] "Netherlands" "New Zealand"
## [25] "Norway" "Oman"
## [27] "Panama" "Slovenia"
## [29] "South Korea" "Spain"
## [31] "Sweden" "Switzerland"
## [33] "United Kingdom" "United States of America"
## [35] "Uruguay" "Venezuela"
summary(cluster3[, c(2:6)])
## Average_Life_Expectancy Average_Wellbeing Footprint Happy_Planet_Index
## Min. :73.89 Min. :5.800 Min. : 2.790 Min. :13.15
## 1st Qu.:79.01 1st Qu.:6.475 1st Qu.: 4.322 1st Qu.:27.85
## Median :80.55 Median :6.950 Median : 5.290 Median :30.60
## Mean :79.98 Mean :6.892 Mean : 5.572 Mean :30.30
## 3rd Qu.:81.68 3rd Qu.:7.325 3rd Qu.: 6.100 3rd Qu.:34.33
## Max. :83.24 Max. :7.800 Max. :15.820 Max. :44.71
## GDP_capita
## Min. : 9703
## 1st Qu.: 20792
## Median : 40178
## Mean : 39057
## 3rd Qu.: 49101
## Max. :105447
8. - UN MAPA MUNDIAL DE LOS CLÚSTERES
map <- map_data("world")
map$region[map$region == "USA"] <- "United States of America"
map1 <- left_join(map, hpi_clean[, c("Region", "Country", "Average_Life_Expectancy", "Average_Wellbeing", "Footprint", "Inequality_of_Outcomes", "Happy_Planet_Index", "GDP_capita", "Population", "Cluster")], by = c("region" = "Country"))
HPI_map <- ggplot(map1) +
geom_polygon(aes(x = long, y = lat, group = group, fill = Cluster, colour = Cluster)) +
coord_equal() +
labs(title = "Agrupación Índice Planeta Feliz", subtitle = "", x = NULL, y = NULL) +
theme_linedraw() +
theme(plot.title = element_text(face = "bold"), plot.subtitle = element_text(face = "italic"),
legend.position = "bottom", legend.justification = "center",
legend.title = element_text(face = "bold"),
)
HPI_map
Cluster 1 (bajo): Este grupo lo conforman la mayoria de los paises del continente africano, donde el PBI medio es de 1754.3, con una puntuacion de bienestar de 4.269, una esperanza de vida de 59.78 años, con emisiones de carbono de 1.419 y adicionalmente la felicidad tiene una puntuacion de 20.41
Cluster 2 (medio): Este grupo lo conforman algunos paises de America del sur como Perú, Guatemala, paises de Asia y paises pertenecientes al continente Europeo, donde el PBI medio es de 7545.4, con una puntuacion de bienestar de 5.318, una esperanza de vida de 73.22 años, con emisiones de carbono de 3.159 y adicionalmente la felicidad tiene una puntuacion de 28.22
Cluster 3 (alto): Este grupo lo conforman paises de America del norte y algunos de America del sur como Brazil, Chile, tambien lo conforman paises del continente de Oceania y algunos paises de Europa, donde el PBI medio es de 39057, con una puntuacion de bienestar de 6.892, una esperanza de vida de 79.98 años, con emisiones de carbono de 5.572 y adicionalmente la felicidad tiene una puntuacion de 30.30
Otros: representado por el color gris(no se tiene información)